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Abstract 

Multiple hypothesis testing is a significant problem in nearly all neuroimaging 
studies. In order to correct for this phenomena, we require a reliable estimate 
of the Family-Wise Error Rate (FWER). The well known Bonferroni correction 
method, while simple to implement, is quite conservative, and can substantially 
under-power a study because it ignores dependencies between test statistics. Per¬ 
mutation testing, on the other hand, is an exact, non-parametric method of es¬ 
timating the FWER for a given a-threshold, but for acceptably low thresholds 
the computational burden can be prohibitive. In this paper, we show that permu¬ 
tation testing in fact amounts to populating the columns of a very large matrix 
P. By analyzing the spectrum of this matrix, under certain conditions, we see 
that P has a low-rank plus a low-variance residual decomposition which makes 
it suitable for highly sub-sampled — on the order of 0.5% — matrix comple¬ 
tion methods. Based on this observation, we propose a novel permutation testing 
methodology which offers a large speedup, without sacrificing the fidelity of the 
estimated FWER. Our evaluations on four different neuroimaging datasets show 
that a computational speedup factor of roughly 50 x can be achieved while recov¬ 
ering the FWER distribution up to very high accuracy. Further, we show that the 
estimated a-threshold is also recovered faithfully, and is stable. 


1 Introduction 

Suppose we have completed a placebo-controlled clinical trial of a promising new drug for a neu- 
rodegenerative disorder such as Alzheimer’s disease (AD) on a small sized cohort. The study is 
designed such that in addition to assessing improvements in standard cognitive outcomes (e.g., 
MMSE), the purported treatment effects will also be assessed using Neuroimaging data. The ra¬ 
tionale here is that, even if the drug does induce variations in cognitive symptoms, the brain changes 
are observable much earlier in the imaging data. On the imaging front, this analysis checks for 
statistically significant differences between brain images of subjects assigned to the two trial arms: 
treatment and placebo. Alternatively, consider a second scenario where we have completed a neu¬ 
roimaging research study of a particular controlled factor, such as genotype, and the interest is to 
evaluate group-wise differences in the brain images: to identify which regions are affected as a 
function of class membership. In either cases, the standard image processing workflow yields for 
each subject a 3-D image (or voxel-wise “map”). Depending on the image modality acquired, these 
maps are of cerebral gray matter density, longitudinal deformation (local growth or contraction) or 
metabolism. It is assumed that these maps have been ‘co-registered’ across different subjects so that 
each voxel corresponds to approximately the same anatomical location. 1213 . 

In order to localize the effect under investigation (i.e., treatment or genotype), we then have to 
calculate a very large number (say, v) of univariate voxel-wise statistics - typically up to several 
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million voxels. For example, consider group-contrast ^statistics (here we will mainly consider t- 
statistics, however other test statistics are also applicable, such as the F statistic used in ANOVA 
testing, Pearson’s correlation as used in functional imaging studies, or the \ 2 test of dependence 
between variates, so long as certain conditions described in Section[23]are satisfied). In some voxels, 
it may turn out that a group-level effect has been indicated, but it is not clear right away what its true 
significance level should be, if any. As one might expect, given the number of hypotheses tests v, 
multiple testing issues in this setting are quite severe, making it difficult to assess the true Family- 
Wise Type I Error Rate (FWER) 0. If we were to address this issue via Bonferroni correction 
0, the enormous number of separate tests implies that certain weaker signals will almost certainly 
never be detected, even if they are real. This directly affects studies of neurodegenerative disorders 
in which atrophy proceeds at a very slow rate and the therapeutic effects of a drug is likely to be mild 
to moderate anyway. This is a critical bottleneck which makes localizing real, albeit slight, short¬ 
term treatment effects problematic. Already, this restriction will prevent us from using a smaller 
sized study (fewer subjects), increasing the cost of pharmaceutical research. In the worst case, an 
otherwise real treatment effect of a drug may not survive correction, and the trial may be deemed a 
failure. 

Bonferroni versus true FWER threshold. Observe that theoretically, there is a case in which the 
Bonferroni corrected threshold is close to the true FWER threshold: when point-wise statistics are 
i.i.d. If so, then the extremely low Bonferroni corrected a-threshold crossings effectively become 
mutually exclusive, which makes the Union Bound (on which Bonferroni correction is based) nearly 
tight. However, when variables are highly dependent - and indeed even without smoothing there are 
many sources of strong non-Gaussian dependencies between voxels, the true FWER threshold can 
be much more relaxed, and it is precisely this phenomenon which drives the search for alternatives to 
Bonferroni correction. Thus, many methods have been developed to more accurately and efficiently 
estimate or approximate the FWER 0[6j0IHl, which is a subject of much interest in statistics (3, 
machine learning flOl . bioinformatics ifTTI . and neuroimaging fl 21 . 

Permutation testing. A commonly used method of directly and non-parametrically estimating the 
FWER is Permutation testing nans. which is a method of sampling from the Global (i.e., Family- 
Wise) Null distribution. Permutation testing ensures that any relevant dependencies present in the 
data carry through to the test statistics, giving an unbiased estimator of the FWER. If we want to 
choose a threshold sufficient to exclude all spurious results with probability 1 — a, we can construct 
a histogram of sample maxima taken from permutation samples, and choose a threshold giving the 
1 — a/2 quantile. Unfortunately, reliable FWER estimates derived via permutation testing come 
at excessive (and often infeasible) computational cost - often tens of thousands or even millions of 
permutation samples are required, each of which requires a complete pass over the entire data set. 
This step alone can run from a few days up to many weeks and even longer HUE). 

Observe that the very same dependencies between voxels, that forced the usage of permutation 
testing, indicate that the overwhelming majority of work in computing so many highly correlated 
Null statistics is redundant. Note that regardless of their description, strong dependencies of almost 
any kind will tend to concentrate most of their co-variation into a low-rank subspace, leaving a 
high-rank, low-variance residual j5j. In fact, for Genome wide Association studies (GWAS), many 
strategies calculate the ‘effective number’ (M e g) of independent tests corresponding to the rank of 
this subspace mm. This paper is based on the observation that such a low-rank structure must also 
appear in permutation test samples. Using ideas from online low-rank matrix completion na we can 
sample a few of the Null statistics and reconstruct the remainder as long as we properly account for 
the residual. This allows us to sub-sample at extremely low rates, generally < 1%. The contribution 
of our work is to significantly speed up permutation testing in neuroimaging, delivering running time 
improvements of up to 50 x . In other words, our algorithm does the same job as permutation testing, 
but takes anywhere from a few minutes up to a few hours, rather than days or weeks. Further, based 
on recent work in random matrix theory, we provide an analysis which sheds additional light on 
the use of matrix completion methods in this context. To ensure that our conclusions are not an 
artifact of a specific dataset, we present strong empirical evidence via evaluations on four separate 
neuroimaging datasets of Alzheimer’s disease (AD) and Mild Cognitive Impairment (MCI) patients 
as well as cognitively healthy age-matched controls (CN), showing that the proposed method can 
recover highly faithful Global Null distributions, while offering substantial speedups. 
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2 The Proposed Algorithm 


We first cover some basic concepts underlying permutation testing and low rank matrix completion 
in more detail, before presenting our algorithm and the associated analysis. 

2.1 Permutation testing 

Randomly sampled permutation testing 1T81 is a methodology for drawing samples under the Global 
(Family-Wise) Null hypothesis. Recall that although point-wise test statistics have well character¬ 
ized univariate Null distributions, the sample maximum usually has no analytic form due to the 
strong correlations across voxels. Permutation is particularly desirable in this setting because it is 
free of any distribution assumption whatsoever CEO- The basic idea of permutation testing is very 
simple, yet extremely powerful. Suppose we have a set of labeled high dimensional data points, 
and a univariate test statistic which measures some interaction between labeled groups for every 
dimension (or feature). If we randomly permute the labels and recalculate each test statistic, then 
by construction we get a sample from the Global Null distribution. The maximum over all of these 
statistics for every permutation sample is then used to construct a histogram, which therefore is a 
non-parametric estimate of the distribution of the sample maximum of Null statistics. For a test 
statistic derived from the real labels, the FWER corrected p-value is then equal to the fraction of 
permutation samples which were more extreme. Note that all of the permutation samples can be 
assembled into a matrix P £ R vxT where v is the number of comparisons (voxels for images), and 
T is the number of permutation samples. 

There is a drawback to this approach, however. Observe that it is in the nature of random sampling 
methods that we get many samples from near the mode(s) of the distribution of interest, but fewer 
from the tails. Hence, to characterize the threshold for a small portion of the tail of this distribution, 
we must draw a very large number of samples just so that the estimate converges. Thus, if we want 
ana = 0.01 threshold from the Null sample maximum distribution, we require many thousands of 
permutation samples — each requires randomizing the labels and recalculating all test statistics, a 
very computationally expensive procedure when v is large. To be certain, we would like to ensure 
an especially low FWER by first setting a very low, and then getting a very precise estimate of the 
corresponding threshold. The smallest possible p-value we can derive this way is 1/T, so for very 
low p-values, T must be very large. 

2.2 Low-rank Matrix completion 

Low-rank matrix completion ED seeks to reconstruct missing entries from a matrix, given only a 
small fraction of its entries. The problem is ill-posed unless we assume this matrix has a low-rank 
column space. If so, then a much smaller number of observations, on the order of r log(u), where 
r is the column space’s rank, and v is its ambient dimension OH is sufficient to recover both an 
orthogonal basis for the row space as well as the expansion coefficients for each column, giving the 
recovery. By placing an ^-norm penalty on the eigenvalues of the recovered matrix via the nuclear 
norm GOl [21 1 we can ensure that the solution is as low rank as possible. Alternatively, we can 
specify a rank r ahead of time, and estimate an orthogonal basis of that rank by following a gradient 
along the Grassmannian manifold l22llT7i l. Denoting the set of randomly subsampled entries as G, 
the matrix completion problem is given as, 

min ||Pn — Pq||f s.t. P = UW; U is orthogonal (1) 

p 

where U £ R , ’ xr is the low-rank basis of P, G gives the measured entries, and W is the set of 
expansion coefficients which reconstructs P in U. Two recent methods operate in an online setting, 
i.e., where rows of P arrive one at a time, and both U and W are updated accordingly H22; :17j. 

2.3 Low rank plus a long tail 

Real-world data often have a dominant low-rank component. While the data may not be exactly 
characterized by a low-rank basis, the residual will not significantly alter the eigen-spectrum of the 
sample covariance in such cases. Having strong correlations is nearly synonymous with having a 
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skewed eigen-spectmm, because the flatter the eigen-spectrum becomes, the sparser the resulting 
covariance matrix tends to be (the “uncertainty principle” between low-rank and sparse matrices 
123)). This low-rank structure carries through for purely linear statistics (such as sample means). 
However, non-linearities in the test statistic calculation, e.g., normalizing by pooled variances, will 
contribute a long tail of eigenvalues, and so we require that this long tail will either decay rapidly, 
or that it does not overlap with the dominant eigenvalues. For (-statistics, the pooled variances are 
unlikely to change very much from one permutation sample to another (barring outliers) — hence 
we expect that the spectrum of P will resemble that of the data covariance, with the addition of a 
long, exponentially decaying tail. More generally, if the non-linearity does not de-correlate the test 
statistics too much, it will preserve the low-rank structure. 

If this long tail is indeed dominated by the low-rank structure, then its contribution to P can be 
modeled as a low variance Gaussian i.i.d. residual. A Central Limit argument appeals to the number 
of independent eigenfunctions that contribute to this residual, and, the orthogonality of eigenfunc¬ 
tions implies that as more of them meaningfully contribute to each entry in the residual, the more 
independent those entries become. In other words, if this long tail begins at a low magnitude and 
decays slowly, then we can treat it as a Gaussian i.i.d. residual; and if it decays rapidly, then the 
residual will perhaps be less Gaussian, but also more negligible. Thus, our development in the next 
section makes no direct assumption about these eigenvalues themselves, but rather that the residual 
corresponds to a low-variance i.i.d. Gaussian random matrix — its contribution to the covariance of 
test statistics will be Wishart distributed, and from that we can characterize its eigenvalues. 

2.4 Our Method 

It still remains to model the residual numerically. By sub-sampling we can reconstruct the low-rank 
portion of P via matrix completion, but in order to obtain the desired sample maximum distribu¬ 
tion we must also recover the residual. Exact recovery of the residual is essentially impossible; 
fortunately, for our purposes we need only need its effect on the distribution of the maximum per 
permutation test. So, we estimate its variance, (its mean is zero by assumption,) and then randomly 
sample from that distribution to recover the unobserved remainder of the matrix. 

A large component in the running time of online subspace tracking algorithms is spent in updat¬ 
ing the basis set U; yet, once a good estimate for U has been found this becomes superfluous. 
We therefore divide the entire process into two steps: training, and recovery. During the training 
phase we conduct a small number of fully sampled permutation tests (100 permutations in our ex¬ 
periments). From these permutation tests, we estimate U using sub-sampled matrix completion 
methods EH El, making multiple passes over the training set (with fixed sub-sampling rate), until 
convergence. In our evaluations, three passes sufficed. Then, we obtain a distribution of the residual 
S over the entire training set. Next is the recovery phase, in which we sub-sample a small fraction 
of the entries of each successive column t, solve for the reconstruction coefficients W(•,() in the 
basis U by least-squares, and then add random residuals using parameters estimated during training. 
After that, we proceed exactly as in a normal permutation testing, to recover the statistics. 

Bias-Variance tradeoff. By using a very sparse subsampling method, there is a bias-variance 
dilemma in estimating S. That is, if we use the entire matrix P to estimate U, W and S, we 
will obtain reliable estimates of S. But, there is an overfitting problem: the least-squares objective 
used in fitting W(•, f) to such a small sample of entries is likely to grossly underestimate the vari¬ 
ance of S compared to where we use the entire matrix; (the sub-sampling problem is not nearly as 
over-constrained as for the whole matrix). This sampling artifact reduces the apparent variance of S, 
and induces a bias in the distribution of the sample maximum, because extreme values are found less 
frequently. This sampling artifact has the effect of ‘shifting’ the distribution of the sample maximum 
towards 0. We correct for this bias by estimating the amount of the shift during the training phase, 
and then shifting the recovered sample max distribution by this estimated amount. 


3 Analysis 

We now discuss two results which show that as long as the variance of the residual is below a certain 
level, we can recover the distribution of the sample maximum. Recall from ([T} that for low-rank 
matrix completion methods to be applied we must assume that the permutation matrix P can be 


4 


decomposed into a low-rank component plus a high-rank residual matrix S: 

P = uw + S, (2) 

where U is a v x r orthogonal matrix that spans the r <C min(i>, t) -dimensional column subspace 
of P, and W is the corresponding coefficient matrix. We can then treat the residual S as a random 
matrix whose entries are i.i.d. zero-mean Gaussian with variance a 2 . We arrive at our first result 
by analyzing how the low-rank portion of P’s singular spectrum interlaces with the contribution 
coming from the residual by treating P as a low-rank perturbation of a random matrix. If this 
low-rank perturbation is sufficient to dominate the eigenvalues of the random matrix, then P can 
be recovered with high fidelity at a low sampling rate G2H3- Consequently, we can estimate the 
distribution of the maximum as well, as shown by our second result. 


The following development relies on the observation that the eigenvalues of PP r are the squared 
singular values of P. Thus, rather than analyzing the singular value spectrum of P directly, we can 
analyze the eigenvalues of PP 7 using a recent result from ll24ll . This is important because in order 
to ensure recovery of P, we require that its singular value spectrum will approximately retain the 
shape of UW’s. More precisely, we require that for some 0 < <5 < 1, 

\fa - fa\ < Sfa i = 1 , ... ,r; fa < Sfa i = r + (3) 


where (f>i and fa are the singular values of UW and P respectively. (Recall that in this analysis P 
is the perturbation of UW.) Thm. 3.1 relates the rate at which eigenvalues are perturbed, 5, to the 
parameterization of S in terms of cH. The theorem’s principal assumption also relates a 2 inversely 
with the number of columns of P, which is just the number of trials t. Note however that the process 
may be split up between several matrices P, , and the results can then be combined. For purposes 
of applying this result in practice we may then choose a number of columns t which gives the best 
bound. Theorem 3.1 also assumes that the number of trials t is greater than the number of voxels 
v, which is a difficult regime to explore empirically. Thus, our numerical evaluations cover the case 
where t < v, while Thm |3.1| covers the case where t is larger. 


From the definition of P in |2]), we have, 

PP T = UWW T U T + SS T + UWS T + SW T U T . (4) 

We first analyze the change in eigenvalue structure of SS T when perturbed by UWW T U T , (which 
has r non-zero eigenvalues). The influence of the cross-terms (UWS 7 and SW T U T ) is addressed 
later. Thus, we have the following theorem. 


Theorem 3.1. Denote that r non-zero eigenvalues of Q = UWW T U T £ R ,;x v by X\ > X -2 > 
,..., X r >0; and let S be a v x t random matrix such that S ij ~ A/”(0, a 2 ), with unknown a 2 . As 
v, t —> oo such that j -C 1, the eigenvalues X j of the perturbed matrix Q + SS 7 will satisfy 

|Ai — Ai| < <5Ai i = l,...,r; Xi < 6X r i = r + (*) 

for some 0 < 5 < 1 , whenever a 2 < 


Proof. The first half of the proof emulates Theorem 2.1 from l24l . Consider the matrix X = \/IS. 
By the structure of S, each entry of X is i.i.d.Gaussian with zero-mean and variance a 2 t. Let 
Y = iXX T and denote its ordered eigenvalues as o,. i = 1,..., v (large to small). Consider the 
random spectral measure 


Hv(A) = i#{ 7i £ A}, AcR 

The Marchenko-Pastur law [?] states that as v, t oo such that f < 1 , the random measure 
p v —> p, where dp is given by 

d ^ a ) = 2wX~,a V( r r+ ~a){a- 7-)l [ 7 -,7 + ] rfa 

where 7 = fa Here lr 7 _ i7+ ] is an indicator function that is non-zero on [ 7 _, 7 + ], 7 ± = a 2 t{ 1 ± 
fao;) 2 are the extreme points of the support of p. It is well known that the extreme eigenvalues 
converge almost surely to [?]. Since v,t —> 00 and 7 = | <1, the length of [ 7 _, 7 + ] is much 
smaller than the values in it. Hence we have. 
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7 ± ~ CT 2 i(l ± 2 ^/ 7 ) ; v /( 7 +-a)(a- 7 _) < a 

and the new dp{a) is given by 

J N _ y/(ff 2 t(l + 2 ^ 7 ) - a)(a - er 2 t(l - 2 ,/y)) ^ 

“M a ) - -27T7cr 4 t 2 - 1 k 2 t(l-2 % /7),<T2t(l+2 N /7)]“ a 

= ‘2'IX'yO 4 i 2 — ( a — 0" 2 i : )“l[ a .2j(i_2 % /7),cr 2 t(l+2 % /7)] , ^ a ' 

The form we have derived for dp(a ) shares some similarities with dpx(x) in Section 3.1 of ll24l . 
The analysis in l24l takes into account the phase transition of extreme eigen values. This is done by 
imitating a time-frequency type analysis on compact support of extreme spectral measure i.e. using 
Cauchy transform. For our case, the Cauchy transform of p(a) is 

G> {z) = 2 ^2 ( z ~ a2t ~ s gn(z)y/{z - cr 2 f) 2 - 47cr 4 f 2 ) 

for 2 G ( 00 , d 2 f(l — 2 ^/ 7 )) U (cr 2 t(l + 2y/j), 00 ) 

Since we are interested in the asymptotic eigen values (and 7 <C 1), G /t (77 j and the functional 
inverse G~ 1 (9) are 


7 +)= 


ty/ll) 


G»{ 7 -) = - 




G-\9) = aH + l+'yaH^ 


Hence, the asymptotic behavior of the eigen values of perturbed matrix Q + SS T is (observing that 
SS T = Y and Q has r non-zero positive eigen values) 

\i(i = 1„ .. ,r) =» J A < + " 2 *+ I $ : f “ (,) 

I 7 <r"f else 

Ai(i = r + 1 ,.. . ,v) « cr 2 f(l - 2 ^ 7 )) 

With Aj, i = 1, ..., v in hand, we now bound the unknown variance a 2 such that (*) is satisfied. We 
only have two cases to consider. 


(1) A i > 7 a 2 t , i = 1,.. ., r (2) A j < 7 a 2 t , i = k, ... ,r (for some k > 1) 


We constrain the unknown a 2 such that case (2) does not arise. Substituting for Xfs from (*) in (*), 
we get. 


(j 2 t + < SXi ; A i > r ycr 2 t ; er 2 f(l — < SX r 

These inequalities will hold when a 2 < (since 7 =<C 1, 5 < 1 and Ai > A 2 >, ..., A r ). □ 

Note that the missing cross-terms would not change the result of Theorem [37] drastic ally, because 
UW has r non-zero singular values and hence UWS T is a low-rank projection of a low-variance 
random matrix, and this will clearly be dominated by either of the other terms. Having justified 
the model in Q, the following thorem shows that the empirical distribution of the maximum Null 
statistic approximates the true distribution. 


Theorem 3.2. Let m t = max, P i t be the maximum observed test statistic at permutation trial 
t, and similarly let rht = max^ P;. t be the maximum reconstructed test statistic. Further, let the 
maximum reconstruction error be e, such that |Pj )t — P,.t| < e. Then, for any real number k > 0, 
we have, 


Pr 



{b—b) > ke 


< 


1 

fc 2 


where b is the bias term described in Section 2, and b is its estimate from the training phase. 
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Proof. Recall that there is a bias term in estimating the distribution of the maximum which must be 
corrected for this is because var(S') underestimates var(S') due to the bias/variance tradeoff. Let b 
be this difference: 

b = E t max Pi, t — E 4 maxP, t . 

Li ’ J Li ’ . 

Further, recall that we estimate b by taking the difference of mean sample maxima between observed 
and reconstructed test statistics over the training set, giving 6 , which is an unbiased estimator of b 
— it is unbiased because a difference in sample means is an unbiased estimator of the difference of 
two expectations. 

Let St = mt — rht • To show the result we must derive a concentration bound on S t , which we will 
do by applying Chebyshev’s inequality. In order to do so, we require an expression for the mean 
and variance of S t . First, we derive an expression for the mean. Taking the expectation over t of 
rri( — ifit we have, 

E t [mt - rh t \ = E* maxP i)t - maxP iit - b 
L i i 

= E t maxPj )t — E t maxP^ —6 
Li ’ J Li ’ . 

= b-b 

where the second equality follows from the linearity of expectation. 

Next, we require an expression for the variance of 5 t . Let i be the index at which the maximum 
observed test statistic occurs for permutation trial t, and likewise let j be the index at which the 
maximum reconstructed test statistic occurs. Thus we have, 

Pi,t < Pi,t + e < P j.t + e 

P i.t f P> P j.t — e, 


and so we have that 
and so 


| m t -hi t | < 2 e 
var (m t — rht) < e 2 - 


Applying Chebyshev’s bound, 

Pr m t —m t — (b—b)>ke < -L 

-I rC 

which completes the proof. 


□ 


4 Experimental evaluations 

Our experimental evaluations include four separate neuroimaging datasets of Alzheimer’s Disease 
(AD) patients, cognitively healthy age-matched controls (CN), and in some cases Mild Cognitive 
Impairment (MCI) patients. The first of these is the Alzheimer’s Disease Neuroimaging Initiative 
(ADNI) dataset, a nation-wide multi-site study. ADNI is a landmark study sponsored by the NIH, 
major pharmaceuticals and others to determine the extent to which multimodal brain imaging can 
help predict onset, and monitor progression of AD. The others were collected as part of other studies 
of AD and MCI. We refer to these datasets as Dataset A—D. Their demographic characteristics are 
as follows: Dataset A: 40 subjects, AD vs. CN, median age : 76; Dataset B: 50 subjects, AD vs. 
CN, median age : 68 ; Dataset C: 55 subjects, CN vs. MCI, median age : 65.16; Dataset D: 70 
subjects, CN vs. MCI, median age : 66.24. 

Our evaluations focus on three main questions: (!) Can we recover an acceptable approximation 
of the maximum statistic Null distribution from an approximation of the permutation test matrix? 
(ii) What degree of computational speedup can we expect at various subsampling rates, and how 
does this affect the trade-off with approximation error? (iii) How sensitive is the estimated a-lcvel 
threshold with respect to the recovered Null distribution? In all our experiments, the rank estimate 
for subspace tracking (to construct the low-rank basis U) was taken as the number of subjects. 
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4.1 Can we recover the Maximum Null? 


Our experiments suggest that our model can recover the maximum Null. We use Kullback-Leibler 
(KL) divergence and Bhattacharya Distance (BD) to compare the estimated maximum Null from our 
model to the true one. We also construct a “Naive-Null”, where the subsampled statistics are pooled 
and the Null distribution is constructed with no further processing (i.e., completion). Using this as 
a baseline. Fig. [5] shows the KL and BD values obtained from three datasets, at 20 different sub¬ 
sampling rates (ranging from 0.1% to 10%). Note that our model involves a training module where 
the approximate ‘bias’ of residuals is estimated. This estimation is prone to noise (for example, 
number of training frames). Hence Fig. [5] also shows the error bars pertaining to 5 realizations 
on the 20 sampling rates. The first observation from Fig. [5] is that both KL and BD measures of 
the recovered Null to the true distribution are < e -5 for sampling rates more than 0.4%. This 





(a) Dataset A (b ) Dataset B (c) Dataset C 

Figure 1: KL (blue) and BD (red) measures between the true max Null distribution (given by the full matrix 
P) and that recovered by our method (thick lines), along with the baseline naive subsampling method (dotted 
lines). Results for Datasets A. B, C are shown here. Plot for Dataset D is in the extended version of the paper. 


suggests that our model recovers both the shape (low BD) and position (low KL) of the null to high 
accuracy at extremely low sub-sampling. We also see that above a certain minimum subsampling 
rate (~ 0.3%), the KL and BD do not change drastically as the rate is increased. This is expected 
from the theory on matrix completion where after observing a minimum number of data samples, 
adding in new samples does not substantially increase information content. Further, the error bars 
(although very small in magnitude) of both KL and BD show that the recovery is noisy. We believe 
this is due to the approximate estimate of bias from training module. 



Figure 2: Scatter plot of computational speedup vs. KL. The plot corresponds to the 20 different samplings 
on all 4 datasets (for 5 repeated set of experiments) and the colormap is from 0.1% to 10% sampling rate. The 
x-axis is in log scale. 
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Figure 3: Computation time (in minutes) of our model compared to that of computing the entire matrix P. 
Results are for the same three datasets as in Fig. [5] Please find the plot for Dataset D in the extended version of 
the paper. The horizontal line (magenta) shows the time taken for computing the full matrix P. The other three 
curves include : subsampling (blue), GRASTA recovery (red) and total time taken by our model (black). Plots 
correspond to the low sampling regime (< 1%) and note the jump in y-axis (black boxes). For reference, the 
speedup factor at 0.4% sampling rate is reported at the bottom of each plot. 




(a) Datasets A, B (b) Datasets C, D 

Figure 4: Error of estimated t statistic thresholds (red) for the 20 different subsampling rates on the four 
Datasets. The confidence level is 1 — a = 0.95. The y-axis is in log-scale. For reference, the thresholds given 
by baseline model (blue) are included. Note that each plot corresponds to two datasets. 


4.2 What is the computational speedup? 

Our experiments suggest that the speedup is substantial. Figs. [6] and [2] compare the time taken to 
perform the complete permutation testing to that of our model. The three plots in Fig. [^correspond 
to the datasets used in Fig. [5j in that order. Each plot contains 4 curves and represent the time 
taken by our model, the corresponding sampling and GRASTA ed recovery (plus training) times 
and the total time to construct the entire matrix P (horizontal line). And Fig. [2] shows the scatter 
plot of computational speedup vs. KL divergence (over 3 repeated set of experiments on all the 
datasets and sampling rates). Our model achieved at least 30 times decrease in computation time in 
the low sampling regime (< 1%). Around 0.5% — 0.6% sub-sampling (where the KL and BD are 
already < e -5 ), the computation speed-up factor averaged over all datasets was 45 x. This shows 
that our model achieved good accuracy (low KL and BD) together with high computational speed 
up in tandem, especially, for 0.4% — 0.7% sampling rates. However note from Fig. [2] that there 
is a trade-off between the speedup factor and approximation error (KL or BD). Overall the highest 
computational speedup factor achieved at a recovery level of e~ 5 on KL and BD is around 50x (and 
this occured around 0.4% — 0.5% sampling rate, refer to Fig. [2|. It was observed that a speedup 
factor of upto 55 x was obtained for Datasets C and D at 0.3% subsampling, where the KL and BD 
were as low as e~ 5 5 (refer to Fig. [5]and the extended version of the paper). 

4.3 How stable is the estimated a-threshold (clinical significance)? 

Our experiments suggest that the threshold is stable. Fig. [4] and Table [I] summarize the clinical 
significance of our model. Fig. [4] show the error in estimating the true max threshold, at 1 — a = 
0.95 level of confidence. The x-axis corresponds to the 20 different sampling rates used and y- 
axis shows the absolute difference of thresholds in log scale. Observe that for sampling rates 
higher than 3%, the mean and maximum differences was 0.04 and 0.18. Note that the binning 
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Data 

name 

Sampling 

rate 

0.95 

1 - 
0.99 

a level 
0.995 

0.999 

A 

0.3% 

0.5% 

0.16 

0.13 

0.11 

0.08 

0.14 

0.10 

0.07 

0.03 

B 

0.3% 

0.5% 

0.02 

0.02 

0.05 

0.07 

0.03 

0.08 

0.13 

0.04 

C 

o% 

0.5% 

0.04 

0.01 

0.13 

0.07 

0.21 

0.07 

0.20 

0.05 

D 

0.3% 

0.5% 

0.08 

0.12 

0.10 

0.13 

0.27 

0.25 

0.31 

0.22 


Table 1: Errors of estimated t statistic thresholds on all datasets at two different subsampling rates. 


resolution of max.statistic used for constructing the Null was 0.01. These results show that not only 
the global shape of the maximum Null distribution is estimated to high accuracy (see Section |4~Tj ) 
but also the shape and area in the tail. To support this observation, we show the absolute differences 
of the estimated thresholds on all the datasets at 4 different a levels in Table Q] The errors for 
1 — a = 0.95, 0.99 are at most 0.16. The increase in error for 1 — a > 0.995 is a sampling artifact 
and is expected. Note that in a few cases, the error at 0.5% is slightly higher than that at 0.3% 
suggesting that the recovery is noisy (see Sec. 

Q-thresholds are both faithful and stable. 

5 Conclusions and future directions 

In this paper, we have proposed a novel method of efficiently approximating the permutation testing 
matrix by first estimating the major singular vectors, then filling in the missing values via matrix 
completion, and finally estimating the distribution of residual values. Experiments on four different 
neuroimaging datasets show that we can recover the distribution of the maximum Null statistic to a 
high degree of accuracy, while maintaining a computational speedup factor of roughly 50 x. While 
our focus has been on neuroimaging problems, we note that multiple testing and False Discovery 
Rate (FDR) correction are important issues in genomic and RNA analyses, and our contribution may 
offer enhanced leverage to existing methodologies which use permutation testing in these settings (6). 
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4.1 and the errorbars of Fig. p). Overall the estimated 
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Fig 1. : Ail 4 datasets 




Subsampling rate (%) Subsampling rate (%) 

(c) Dataset C (d) Dataset D 

Figure 5: KL (blue) and BD (red) measures between the true max. null distribution (given by the full matrix 
P) and that recovered by our method (thick lines), along with the baseline naive subsampling method (dotted 
lines). Each plot corresponds to one of the four datasets used in our evaluations. Note that the y-axis is in log 
scale. 


Fig 2. : All 4 datasets 
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